Unstable markup: A template-based information extraction from web sites with unstable markup

نویسندگان

  • Maxim Kolchin
  • Fedor Kozlov
چکیده

This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of Semantic Publishing Challenge 2014. Our approach is based on so-called “templates of web site’ blocks“ and DBpedia for crawling and linking extracted entities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Template-Based Information Extraction from Web Sites with Unstable Markup

This paper presents results of a work on crawling CEUR Workshop proceedings web site to a Linked Open Data (LOD) dataset in the framework of ESWC 2014 Semantic Publishing Challenge 2014. Our approach is based on using an extensible template-dependent crawler and DBpedia for linking extracted entities, such as the names of universities and countries.

متن کامل

Grammatical inference for information extraction and visualisation on the Web

The world-wide web contains a wealth of database-style information scattered across different sites that could be better used if it were integrated into a single view. Since document formats vary widely between sites and frequently mingle structural with presentation markup, extracting and integrating data from web pages is a difficult challenge. Manually writing extraction wrappers is expensiv...

متن کامل

iDocument: Using Ontologies for Extracting and Annotating Information from Unstructured Text

Due to the huge amount of text data in the WWW, annotating unstructured text with semantic markup is a crucial topic in Semantic Web research. This work formally analyzes the incorporation of domain ontologies into information extraction tasks in iDocument. Ontologybased information extraction exploits domain ontologies with formalized and structured domain knowledge for extracting domain-relev...

متن کامل

AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages

The DARPA Agent Markup Language (DAML) is an emerging knowledge representation for the Semantic Web. DAML can encode the semantics of a document for use by agents on the web. However, DAML annotation of documents and web pages is a tedious and time consuming task. AeroDAML is a knowledge markup tool that applies natural language information extraction techniques to automatically generate DAML a...

متن کامل

KnowMore - Knowledge Base Augmentation with Structured Web Markup

Knowledge bases are in widespread use for aiding tasks such as information extraction and information retrieval, for example in Web search. However, knowledge bases are known to be inherently incomplete, where in particular tail entities and properties are under-represented. As a complimentary data source, embedded entity markup based on Microdata, RDFa, and Microformats have become prevalent o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014